Methods and apparatus for natural spoken language speech recognition with word prediction

ABSTRACT

A word prediction method and apparatus improves precision and accuracy. For the prediction of a sixth word “?”, a partial analysis tree having a modification relationship with the sixth word is predicted. “sara-ni sho-senkyoku no” has two partial analysis trees, “sara-ni” and “sho-senkyoku no”. It is predicted that “sara-ni” does not have a modification relationship with the sixth word, and that “sho-senkyoku no” does. Then, “donyu”, which is the sixth word from “sho-senkyoku no”, is predicted. In this example, since “sara-ni” is not useful information for the prediction of “donyu”, it is preferable that “donyu” be predicted only by “sho-senkyoku no”.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a divisional application of U.S. patent applicationSer. No. 09/904,147, filed on Jul. 11, 2001 now U.S. Pat. No. 7,359,852,which claims priority from Japanese Patent Application No. 2000-210599,filed on Jul. 11, 2000, both of which are incorporated herein byreference in their entireties.

FIELD OF THE INVENTION

The present invention relates to a speech recognition apparatus and amethod therefor, and in particular to a speech recognition apparatus forrecognizing the natural language spoken by persons that thereafter isused for composing sentences and for creating text data and a methodtherefor.

BACKGROUND OF THE INVENTION

A statistical method for using an acoustic model and a language modelfor speech recognition is well known, and has been featured in suchpublications as: “A Maximum Likelihood Approach to Continuous SpeechRecognition,” L. R. Bahl, et. al., IEEE Trans. Vol. PAMI-5, No. 2,March, 1983; and “Word based approach to large-vocabulary continuousspeech recognition for Japanese,” Nishimura, et. al., InformationProcessing Institute Thesis, Vol. 40, No. 4, April, 1999.

According to an overview of this method, a word sequence W is voiced asa generated sentence and is processed by an acoustic processor, and froma signal that is produced a feature value X is extracted. Then, usingthe feature value X and the word sequence W, assumed optimal recognitionresults W′ are output in accordance with the following equation to forma sentence. That is, a word sequence such that, when the word sequence Wis voiced, the product of the appearance probability P (XW), of thefeature value (X), and the appearance probability (P(W)), of the wordsequence W, is the maximum (argmax) and is selected as the recognitionresults W′.

$\begin{matrix}{W^{\prime} = {\begin{matrix}{\arg\mspace{14mu}\max} \\w\end{matrix}{P( {W X )\begin{matrix}{\arg\mspace{14mu}\max} \\w\end{matrix}{P(w)}{P( {X W )} }} }}} & \lbrack {{Equation}\mspace{14mu} 1} \rbrack\end{matrix}$where P(W) is for a language model, and P(W|X) is for an acoustic model.

In this equation, the acoustic model is employed to obtain theprobability P(X|W), and words having a high probability are selected asa proposed word for recognition. This language model is frequently usedto provide an approximation of the probability P(W).

For the conventional language model, normally, the closest word sequenceis used as a history. An example is an N-gram model. With this method,an approximation of a complete sentence is produced by using theprobability of the appearance of N sequential words, i.e., anapproximation of the appearance probability of the word sequence W. Thismethod is exemplified by the following established form.

$\begin{matrix}\begin{matrix}{{P(w)} = {{P( w_{0} )}{P( {w_{1} w_{0} ){P( {{w_{2} {w_{0}w_{1}} )},\ldots\mspace{14mu},} }} }}} \\{P( {w_{n} {{w_{0}w_{1}},\ldots\mspace{14mu},w_{n - 1}} )} } \\{= {{P( w_{0} )}{P( {w_{1} w_{0} ){\prod\limits_{i = 2}^{n}\;{P( {w_{i} {w_{i - 2}w_{i - 1}} )} }}} }}}\end{matrix} & \lbrack {{Equation}\mspace{14mu} 2} \rbrack\end{matrix}$

Assume that in the above equation the appearance probability of the nextword W[n] is affected only by the immediately preceding N−1 words. Forthis purpose, various values can be used for N, but since N=3 isfrequently employed because of the balance it provides betweeneffectiveness and the learning data that is required, in this equation,N=3 is employed, and the above method is therefore called a tri-gram ora 3-gram method. Hereinafter, when the n-th word in a word sequence Wconsisting of n words is represented by W[n], the appearance probabilitycondition for the calculation of the word W[n] is that there are N−1preceding words (two words), i.e., the appearance probability for theword sequence W is calculated using P(W[n]|W[n−2]W[n−1]). In thisequation, the statement to the left (W[n]) of “|” represents a word tobe predicted (or recognized), and the statement to the right(W[n−2]W[n−1]) represents the first and the second preceding wordsrequired to establish the condition. This appearance probabilityP(W[n]|W[n−2]W[n−1]) is learned for each word W[n] by using text datathat have previously been prepared and stored as part of a dictionarydatabase. For example, for the probability that a “word” will appear atthe beginning of a sentence, 0.0021 is stored, and for the probability a“search” will follow, 0.001 is stored.

The Tri-gram model will now be described by using a simple phrase. Thisphrase is “sara-ni sho-senkyoku no (further, small electoral districts)”and is used to predict the following “donyu (are introduced)”. FIG. 8Ais a diagram showing the state before the prediction is fulfilled, andFIG. 8B is a diagram showing the state after the prediction isfulfilled. As is shown in FIG. 8A, the phrase consists of five words,“sara-ni”, “sho”, “senkyo”, “ku” and “no”, while the predicted word isrepresented by “?”, and the arrows in FIGS. 8A and 8B are used todelineate the modifications applied to the words. As previouslydescribed, in the tri-gram model, two preceding words are constantlyemployed to predict a following word. Therefore, in this example,“donyu” is predicted by “ku” and “no”, words enclosed by solid lines inFIG. 8A.

However, depending on the sentence structure, the tri-gram method foremploying two immediate words to predict a following word is not themost appropriate. For example, the tri-gram method is not appropriatefor the case illustrated in FIG. 9, wherein the phrase “nani-ga imaseiji-no saisei-no tame-ni (at present, for reconstruction of thepolitics, what)” is used to predict a word. According to the tri-grammethod, as is shown in FIG. 9A, “tame” and “ni” are employed to predict“hitsuyo (is required)”. But in addition to these words, otherstructurally related words, such as “nani” or “ima” must be taken intoaccount in order to increase the accuracy of the prediction.

Chelba and Jelinek proposed a model for employing the head word of twoimmediately preceding partial analysis trees to predict a succeedingword. According to the Chelba & Jelinek model, the words are predictedin order, as they appear. Therefore, when the i-th word is to bepredicted, the (i−1)th word and the structure are established. In thisstate, first, the head word of the two immediately preceding partialanalysis trees are employed to predict, in the named order, thefollowing word and its speech part. At this time, the modificationrelationship between the head word of the two immediately precedingpartial analysis trees and the predicted word is not taken into account.After the word is predicted, the sentence structure that includes theword is updated. Therefore, the accuracy of the prediction can beimproved compared with the tri-gram method, which employs twoimmediately preceding words to predict a following word. However, in themodel proposed by Chelba and Jelinek, a word is predicted by referringto the head word of the two immediately preceding partial analysistrees, regardless of how the words are modified, so that, depending onthe sentence structure, the accuracy of the prediction may be reduced.This will be explained by referring to the phrase “sara-ni sho-senkyokuno”, used for the tri-gram model.

As is shown in FIGS. 10A to 10C, the phrase “sara-ni sho-senkyoku no” isconstituted by two partial analysis trees, and the head word of thetrees are “sara-ni” and “no”, which are enclosed by solid lines in FIG.10A. Therefore, according to the method proposed by Chelba and Jelinek,“sara-ni” and “no”, which are two immediately preceding head word as isshown in FIG. 10B, are employed to predict the next word “donyu”. When“donyu” is predicted, as is shown in FIG. 10C, the sentence structureincluding “donyu” is predicted. In the prediction of the structure, themodification of words as indicted by arrows is included. Since “sara-ni”does not modify “donyu”, it is not only useless for the prediction ofthe word “donyu”, but also may tend to degrade the prediction accuracy.

For the phrase “nani-ga ima seiji-no saisei-no tame-ni”, in FIG. 11, thefollowing prediction process is performed. This phase is constituted bythree partial analysis trees “nani-ga”, “ima” and “seiji-no saisei-notame-ni”, and the head word of the trees are “ga”, “ima” and “ni”. Asindicated by the solid line enclosures in FIG. 11A, the two immediatelypreceding head word are “ima” and “ni”. Therefore, as is shown in FIG.11B, “hitsuyo” is predicted by using “ima” and “ni”. And after “hitsuyo”is predicted, the sentence structure that includes “hitsuyo” ispredicted, as is shown in FIG. 11C.

To predict a word, the modifications performed by words provides usefulinformation. However, that “nani-ga” is a modifier is not taken intoaccount. As is described above, according to the method proposed byChelba and Jelinek, no consideration is given for information that isuseful for prediction that frequently occurs.

A need therefore exists for a word prediction method that supplyimproved prediction accuracy, and a speech recognition method therefor.The following will provide a brief summary of the invention.

SUMMARY OF THE INVENTION

The present invention focuses on the fact that, at each word predictionstep, a sequence of partial analysis trees covering currently obtainedword sequences can be employed as historical information. A partialanalysis tree sequence, when used as historical information, can beemployed to select a partial analysis tree carrying information that canmore usefully be employed for the prediction of the next word. Inessence, when a word sequence employed as history and a modificationstructure are used to select the most useful word and/or word sequencefor predicting the next word, prediction accuracy can be improved. Thatis, after a partial analysis tree that includes a modification functionfor a word to be predicted is specified, this partial analysis tree,i.e., a word and/or a word sequence that is estimated to have amodification relationship with a word that is to be predicted, isemployed for the prediction of the following word. Unlike the methodproposed by Chelba and Jelinek, since the structure of a sentence, toinclude the word to be predicted, is employed, only information that isuseful for prediction will be taken into account.

Based on the above described idea, according to the present invention, aspeech recognition method is provided, said method comprising the stepsof: specifying a structure of a phrase from a beginning of the phrase toa j-th word, wherein j=0, 1, 2, . . . ; employing a sentence structureup to said j-th word to specify one or multiple partial analysis treesmodifying the (j+1)th word; predicting said (j+1)th word based on saidone or multiple partial analysis trees; obtaining a putative sentencestructure for the phrase including the predicted (j+1)th word and aprobability value for said putative sentence structure; when the abovesteps have been performed up to the last word of said sentence,selecting as speech recognition results a sentence structure and a wordsequence having maximum probability values; and returning said speechrecognition results to a user.

According to the present invention, a speech recognition apparatus isprovided, said apparatus comprising: an arrangement adapted to specify astructure of a phrase from a beginning of the phrase to a j-th word,wherein j=0, 1, 2, . . . ; an arrangement adapted to employ a sentencestructure up to said j-th word to specify one or multiple partialanalysis trees modifying the (j+1)th word; an arrangement adapted topredict said (j+1)th word based on said one or multiple partial analysistrees and obtaining a putative sentence structure for the phraseincluding the predicted (j+1)th word and a probability value for saidputative sentence structure; an arrangement adapted to recognize, whenthe above steps have been performed up to the last word of saidsentence, select as speech recognition results a sentence structure anda word sequence having maximum probability values; and an arrangementadapted to return said speech recognition results to a user.

The present invention also provides a program storage device readable bycomputer, tangibly embodying a program of instructions executable by thecomputer to perform method steps for speech recognition, said methodcomprising the steps of: specifying a structure of a phrase from abeginning of the phrase to a j-th word, wherein j=0, 1, 2, . . . ;employing a sentence structure up to said j-th word to specify one ormultiple partial analysis trees modifying the (j+1)th word; predictingsaid (j+1)th word based on said one or multiple partial analysis trees;obtaining a putative sentence structure for the phrase including thepredicted (j+1)th word and a probability value for said putativesentence structure; when the above steps have been performed up to thelast word of said sentence, selecting as speech recognition results asentence structure and a word sequence having maximum probabilityvalues; and returning said speech recognition results to a user.

For a better understanding of the present invention, together with otherand further features and advantages thereof, reference is made to thefollowing description, taken in conjunction with the accompanyingdrawings, and the scope of the invention that will be pointed out in theappended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram for the embodiment.

FIG. 2 is a diagram showing the configuration of a computer systemaccording to the embodiment.

FIG. 3 is a diagram for explaining word prediction according to theembodiment.

FIG. 4 is a diagram for explaining an example of word predictionaccording to the embodiment.

FIG. 5 is a flowchart for explaining speech recognition according to theembodiment.

FIG. 6 is a diagram showing another example for explaining wordprediction according to the embodiment.

FIG. 7 is a diagram showing an additional example for explaining wordprediction according to the embodiment.

FIG. 8 is a diagram showing an example for explaining word predictionusing a tri-gram model.

FIG. 9 is a diagram showing another example for explaining wordprediction using a tri-gram model.

FIG. 10 is a diagram showing an example for explaining word predictionusing the method proposed by Chelba and Jelinek.

FIG. 11 is a diagram showing another example for explaining wordprediction using the method proposed by Chelba and Jelinek.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The preferred embodiment of the present invention will now be described.It is to be understood that the present invention, in accordance with atleast one presently preferred embodiment, includes method steps (e.g.speech recognition) that may be employed by elements that may beimplemented on at least one general-purpose computer running suitablesoftware programs. These may also be implemented on at least oneIntegrated Circuit or part of at least one Integrated Circuit. Thus, itis to be understood that the invention may be implemented in hardware,software, or a combination of both.

It will also be readily understood that the present invention, asgenerally described and illustrated in the Figures herein, may bearranged and designed in a wide variety of different configurations.Thus, the following more detailed description of the embodiments of themethods of the present invention, as represented in the Figures, is notintended to limit the scope of the invention, as claimed, but is merelyrepresentative of selected embodiments of the invention. Thus, althoughillustrative embodiments of the present invention have been describedherein with reference to the accompanying Figures, it is to beunderstood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may beaffected therein by one skilled in the art without departing from thescope or spirit of the invention.

FIG. 1 is a block diagram illustrating the configuration including aspeech recognition apparatus, according to an embodiment. A wordsequence W, generated as a sentence (a true sentence) by a block 101, isreleased as S (block 102). The released S is transmitted to an acousticprocessor 111 constituting speech recognition means 110. The acousticprocessor 111 converts the input S into a signal X, which it stores. Thesignal X is changed into a recognition result W′ by a language decoder112, which includes an acoustic model 113 that has learned the featureof a sound and a dictionary 114 in which text data prepared in advancethrough learning is stored. The sentence for the result W′ issubsequently displayed (block 120).

FIG. 2 is a diagram showing an example system for the employment of thespeech recognition method according to one embodiment. This systemcomprises a microphone 210, a computer 220 and a display device 230. Thecomputer 220 includes a sound card 221, a memory 222 and a CPU 223. Inthe system in FIG. 2, a speaker's speech is received as an analog signalby the microphone 210. Thereafter, the sound card 221 of the computer220 converts the analog signal into a digital signal that is stored inthe memory 222. The acoustic model 113 and the dictionary 114, includingthe language model, are also stored in the memory 222. Subsequently, theCPU 223 decodes the language based on the digital signal and thedictionary 114 stored in the memory 222, and also interprets andexecutes a program for implementing a word prediction method that willbe described later. The obtained language is the recognition result, andis displayed on the display device 230. This program is stored in thememory 222.

In this system, the microphone 210 is a member separate from thecomputer 220, but it may be integrally formed with the computer 220 orthe display device 230. In other words, so long as a microphone forconverting speech into equivalent electrical energies is provided, anyform can be employed. Furthermore, the recognition result is displayedon the display device 230, e.g., a CRT; however, but the result can alsobe transferred to and printed by a printer, or it can be stored on aflexible disk or another storage medium.

In one embodiment, as an assumption for the word prediction method thatwill be described below, proposed words are selected as the result ofcalculations that use the acoustic model 113 for the speech signalobtained by the acoustic processor 111. The following word predictionprocess is performed for these selected words, and the speechrecognition results are finally obtained.

The word prediction method for use with the exemplary system will now bedescribed. As is described above, according to the present invention itis proposed that a partial analysis tree that has a modificationrelationship with a word to be predicted is predicted, and then, thepartial analysis tree, i.e., a preceding word and/or word sequence, itis estimated is related to the next word, is employed to predict thenext word. In other words, the next word is predicted by using thepartial analysis tree that has a modification relationship with the wordto be predicted.

This embodiment will be explained based on the example phrase “sara-nisho-senkyoku no”, which was used for explaining the tri-gram method andthe method proposed by Chelba and Jelinek. The phrase “sara-nisho-senkyoku no” comprises the five words “sara-ni”, “sho”, “senkyo”,“ku” and “no”. Assuming “j” is used to represent the position of a wordmeasured from the beginning of the phrase, “no” is the fifth word.Further, as shown in FIGS. 3A to 3C, it is assumed that there are threestructure types for the phrase that includes the fifth word “no”. Thesentence structure in this case represents the modification relationshipamong the words. The three structures will now be described.

In FIG. 3A, while “sara-ni” does not modify “no”, “no” is modified by“ku”. This state is shown by using arrows; the arrow from “sara-ni”points to the word following “no”, while the arrow from “ku” points to“no”. Since “sara-ni” forms a partial analysis tree and “sho-senkyokuno” forms another partial analysis tree, in the example in FIG. 3A theonly partial analysis tree related to “no” is “sho-senkyoku no”. Itshould be noted that the probability value for this structure is definedas 0.034.

In FIG. 3B, neither “sara-ni” nor “ku” modify “no”. Therefore, thearrows from “sara-ni” and “ku” point to words following “no”. Theprobability value for this sentence structure is defined as 0.001.

In FIG. 3C, instead of the “no” in FIGS. 3A and 3B, the use of “wo”,which has a similar sound, is predicted. The prediction of “wo” isinstructed by the acoustic model 113. As for the sentence structure, asin FIG. 3A, while “sara-ni” does not modify “no”, “ino” is modified by“ku”. The probability value for the case in FIG. 3C is defined as 0.028.And since the probability value in FIG. 3A is the highest, at this timethe case represented by FIG. 3A, which has the maximum probabilityvalue, can be the proposed result for speech recognition.

The cases in FIGS. 3A to 3C are merely examples used for explaining anembodiment. For example, when the fifth word is “wo”, the same case asin FIG. 3B may be present, or a case where the fifth word is “to”instead of “no” or “wo” may be present. In any case, in FIGS. 3A to 3C,the structure, including the j-th (fifth) word, and the probabilityvalue are shown. It should be noted that the statement s[5][0] in FIG. 3indicates that the fifth word is a target to be processed, and [0] meansone of the words having a modification relationship is a target for theprocess.

Then, the sixth word is predicted. For this prediction, first, thesentence structure, including the sixth word, is specified. For theexample in FIG. 3A, there are three available cases: a case where only“no” modifies the sixth word; a case where both “sara-ni” and “no”modify the sixth word; and a case where “sara-ni” and “no” do not modifythe sixth word. The sixth word is predicted for the respective threecases. These three cases are shown in FIGS. 3( a-1) through (a-3). Inthis embodiment, before the sixth word is predicted, the sentencestructure, including the sixth word, is specified.

In the dictionary 114, the appearance frequency of a predetermined wordrelative to another predetermined word and/or word sequence is writtenbased on text data that has been learned. For example, assuming thatsentence “sho-senkyoku no” has appeared in the text data n times and hasbeen followed by “donyu” m times, the frequency appearance for “donyu”relative to “sho-senkyoku no” is m/n. When two partial analysis trees of“sara-ni” and “sho-senkyoku no” are employed to predict “donyu”, thefrequency whereat “donyu” appears after “sara-ni” and “sho-senkyoku no”must be taken into account. That is, assuming that, in the text data, asentence including “sara-ni” and “sho-senkyoku no” appeared n′ times andthereafter the word “donyu” appeared m′ times, the appearanceprobability for “donyu” relative to “sara-ni” and “sho-senkyoku no” ism′/n′. At this time, according to the empirical rule, very frequently“sara-ni” will modify a declinable word, such as a verb or an adjective,and will seldom modify an indeclinable word, such as a noun. Thus, sincethe appearance frequency m′ of noun “donyu” is very small, theprobability value when “donyu” is predicted by using two partialanalysis trees “sara-ni” and “sho-senkyoku no” is considerably smallerthan the probability value obtained when “donyu” is predicted merely byusing “sho-senkyoku no”. In other words, it is not preferable for“sara-ni” to be taken into account for the prediction of “donyu”.

Therefore, when “no” is used to predict “donyu”, the probability valuefor the phrase “sara-ni sho-senkyoku no donyu” is greater than theprobability value for this sentence when “sara-ni” and “no” are employedto predict “donyu”.

In this embodiment, FIGS. 3( a-1) and (a-2) have been explained, and theprobability value is calculated in the same manner for the case in FIG.3( a-3). Further, the prediction process is performed in the samemanner, up to the last word of the sentence.

The word prediction processing for the case in FIG. 3( a-1) will now bedescribed while referring to FIGS. 4A to 4C. In FIG. 4A, the state inFIG. 3( a-1) is shown. In this state, a partial analysis tree having amodification relationship with the next word “?” (the sixth word in thiscase) is specified. In this case, the partial analysis tree“sho-senkyoku no” modifies the sixth word, while the sixth word is notmodified by the partial analysis tree “sara-ni”. This modification isshown in FIG. 4B. That is, the arrow from “sara-ni” points to a wordfollowing the sixth word, and indicates that no modification has beenestablished between the sixth word and “sara-ni”. The arrow from “no” in“sho-senkyoku no” points to the sixth word “?”, and indicates that theword sequence “sho-senkyoku no” modifies the sixth word.

As is described above, after the sentence structure, including the sixthword, has been predicted, “donyu” is predicted using the partialanalysis tree “sho-senkyoku no”, which has a modification relationshipwith the sixth word. Further, after the prediction of “donyu”, as isshown in FIG. 4C, the sentence structure, to include “donyu”, ispredicted. In other words, according to the case in FIG. 3( a-1), since“sara-ni”, which probably reduces the prediction accuracy, is not takeninto account, a high probability value can be obtained.

The word prediction method for this embodiment has been explained. Next,the processing for finally outputting the speech recognition resultswill be explained while referring to the flowchart in FIG. 5. Accordingto this processing, as previously described, proposed words are selectedas the results of calculations using the acoustic model 113 for speechsignal acquired by the acoustic processor 111, and the narrowing of theselected words is further performed by the prediction.

In FIG. 5, which word is to be processed (S100) and which structure isto be processed (S101) are determined. The position of a word to beprocessed is represented by using “j”, and a structure to be processedis represented by “i”. Since the prediction is performed starting at thebeginning of the sentence, the initial values of j and i are 0. Thespecific form of j and i can be easily understood by referring to FIG.3.

Then, the structure of a sentence, including a word to be predicted, andits probability value are obtained (S102). In FIG. 5, s[j][ ] at S104represents the sentence structure that includes the j-th word and theprobability value. In the example in FIG. 3, first, s[5][0], i.e., thefirst sentence structure of the three, and its probability value areobtained for the fifth word. Since this sentence structure and theprobability value are employed for the predication of the next word,these are enumerated relative to s[j+1][ ] (S102). In the example inFIG. 3, first, FIG. 3( a-1) is enumerated for s[6][ ].

When there are multiple sentence structures, the process at s102 isperformed for all of them. To do this, the process at S103, where i=i+1,and the process at S104, for determining whether all s[j][ ] areexamined, are performed.

When the process at S102 has been completed for all the structures, thesame process is performed for the next word, which is defined as j=j+1(S105). When j=j+1 is not the last word of the sentence, the processsequence from S101 is performed. When j=j+1 is the last word, thesentence structure and the word sequence having the maximum probabilityvalue are selected from s[j][ ], and are displayed on the display device230. This sentence structure can be displayed by using arrows toindicate modifications, or as a partial analysis tree structure.

In the above embodiment, the present invention is carried out on apersonal computer. However, the present invention can be provided as astorage medium in which a predetermined program is stored, or atransmission apparatus for transmitting a program.

The present invention will now be described based on an example phrase“nani-ga ima seiji-no saisei-no tame-ni”. The phrase “nani-ga imaseiji-no saisei-no tame-ni” consists of nine words, “nani”, “ga”, “ima”,“seiji”, “no”, “saisei”, “no”, “tame” and “ni”, and is constituted bythree partial analysis trees “nani-ga”, “ima” and “seiji-no saisei-notame-ni”.

In the state in FIG. 6A, the word prediction up to “nani-ga ima seiji-nosaisei-no tame-ni” is completed. As is described above, this phrase isformed of three partial analysis trees, “nani-ga”, “ima” and “seiji-nosaisei-no tame-ni”. As for the partial analysis tree “nani-ga”, it hasbeen predicted that “nani” modifies “ga”. In other words, the wordmodified by the partial analysis tree “nani-ga” is unknown. This stateis understood because the arrow from “ga” in FIG. 6A points to “?”.Further, the words modified by the partial analysis trees “ima” and“seiji-no saisei-no tame-ni” are also unknown.

Based on the state in FIG. 6A, the partial analysis tree that modifiesthe next word (the tenth word in this example) is predicted. In thisexample phrase, it is predicted or specified that all of the threepartial analysis trees, “nani-ga”, “ima” and “seiji-no saisei-notame-ni”, modify the tenth word. This modification is shown in FIG. 6B.That is, the arrows from “ga” in “nani-ga”, “ima”, and “ni” in “seiji-nosaisei-no tame-ni” point to the tenth word.

As is described above, when the sentence structure, to include the tenthword, has been specified, the tenth word is predicted. That is, sinceall three partial analysis trees (“nani-ga”, “ima” and “seiji-nosaisei-no tame-ni”) modify the word to be predicted, all of these areconsidered to predict “hitsuyo”.

According to the method proposed by Chelba and Jelinek, “hitsuyo” ispredicted using “tame” and “ni”, while in this embodiment “nani-ga”,which is useful information for predicting “hitsuyo”, is also employed,the prediction accuracy in this embodiment is higher.

Up to now, Japanese phrases have been employed as examples. Anexplanation will now be given using an English phrase. One ofdifferences between Japanese and English is that the direction of themodification in Japanese is constant, whereas it is not in English. Whenthis embodiment is used for a language, such as English, where thedirection of modification is not constant, only a partial analysis treehaving a modification relationship with the next word and the directionof the modification need be specified, and the partial analysis treehaving the modification relationship need only be employed to predictthe next word.

Assume as an example that “after” is predicted from “the contact endedwith a loss”. The phrase “the contact ended with a loss” consists of sixwords, “the”, “contact”, “ended”, “with”, “a” and “loss”. Further, “thecontact” forms one partial analysis tree, and “ended with a loss” formsanother partial analysis tree.

FIG. 7A is a diagram showing the state wherein the prediction of wordsup to “the contact ended with a loss” is completed. As is describedabove, this phrase consists of two partial analysis trees “the contact”and “ended with a loss”. As indicated by arrows in FIG. 7A, “the” in thepartial analysis tree “the contact” modifies “contact”. In the partialanalysis tree “ended with a loss”, “a” modifies “loss”, “loss” modifies“with” and “with” modifies “ended”. As is described above, themodification in English has two directions: from front to rear and fromrear to front.

Based on the state in FIG. 7A, the partial analysis tree related to thenext word “?” (the seventh word in this case) is predicted. In otherwords, it is predicted that, as is shown in FIG. 7B, the seventh wordmodifies “ended”. Since “ended” is included in the partial analysis tree“ended with a loss”, the seventh word is predicted based on themodification relationship with “ended with a loss”. Then, as is shown inFIG. 7C, “after” is predicted from the partial analysis tree “ended witha loss”.

Example Experiment

A model consisting of approximately 1000 sentences was prepared based ona newspaper article. An experiment for obtaining an entropy wasconducted for this model using the method of this embodiment. Thefollowing results were obtained.

This Embodiment: 4.05 [bit]

tri-gram: 4.27 [bit]

The value of 4.05 [bit] in this embodiment corresponds to a selectionfor which 16.6 words were used, and the value of 4.27 [bit] correspondsto a selection for which 19.3 words were used. Therefore, it wasconfirmed that the word prediction accuracy was improved when thisembodiment was used.

As is described above, according to the present invention, the sentencestructure, to include a word to be predicted, is specified, and theprediction of the word is performed using a word or a word sequencehaving a modification relationship with the word to be predicted. Sincethe modification relationship is useful information for the wordprediction, the word prediction accuracy is increased.

If not otherwise stated herein, it is to be assumed that all patents,patent applications, patent publications and other publications(including web-based publications) mentioned and cited herein are herebyfully incorporated by reference herein as if set forth in their entiretyherein.

Although illustrative embodiments of the present invention have beendescribed herein with reference to the accompanying drawings, it is tobe understood that the invention is not limited to those preciseembodiments, and that various other changes and modifications may beaffected therein by one skilled in the art without departing from thescope or spirit of the invention.

1. A speech recognition method, said method comprising acts of: A)receiving a spoken phrase; B) applying an acoustic model to the phraseto select proposed words for the phrase; C) determining a structure ofthe phrase from a beginning of the phrase to a j-th word, wherein j is apositive integer; D) based at least in part on the determined structureof the phrase up to said j-th word, identifying one or multiple partialanalysis trees modifying the (j+1)th word; E) predicting said (j+1)thword from the proposed words based on said one or multiple partialanalysis trees; F) obtaining a putative sentence structure for thephrase including the predicted (j+1)th word and a probability value forsaid putative sentence structure; G) increasing j; H) while the j-thword is not a last word of the phrase, repeating steps C-G; I) selectingas speech recognition results for the phrase a sentence structure and aword sequence obtained in steps C-H having maximum probability values;and J) returning said speech recognition results to a user.
 2. Themethod according to claim 1, wherein the act E further comprises:predicting said (j+1)th word based solely on said one or multiplepartial analysis trees having a modification relationship with said(j+1)th word.
 3. The method according to claim 1 wherein, when multiplepartial analysis trees modifying the (j+1)th word are specified, the actE is performed based on said specified multiple partial analysis treesmodifying the (j+1)th word.
 4. The method according to claim 1 furthercomprising: when a modification direction between said one or multiplepartial analysis trees modifying the (j+1)th word and said (j+1)th wordis not constant, specifying said modification direction.
 5. The methodaccording to claim 1 wherein when multiple modifications are establishedbetween said one or multiple partial analysis trees modifying the(j+1)th word and said (j+1)th word, predicting a different (j+1)th wordfor each of said modifications.
 6. The method according to claim 1further comprising: utilizing a dictionary database, containingappearance frequencies of a predetermined word relative to anotherpredetermined word and/or word sequence obtained from text data that hasbeen learned, to determine which of multiple partial analysis trees areto be utilized for predicting said (j+1)th word.
 7. The method accordingto claim 1, wherein the act J further comprises: displaying said speechrecognition results on a display device.
 8. The method according toclaim 1, wherein the act J further comprises: storing said speechrecognition results in an external storage medium.
 9. The methodaccording to claim 1, wherein the act J further comprises: transferringsaid speech recognition results to a printer; and printing said speechrecognition results.
 10. The method according to claim 1, wherein: theact D comprises identifying one or more partial analysis trees from thedetermined structure of the phrase up to the j-th word, and determiningwhich of the one or more partial analysis trees modify the (j+1)th word;and the act E comprises predicting the (j+1)th word based only on thoseof the one or more partial analysis trees that are determined to modifythe (j+1)th word.
 11. A speech recognition apparatus comprising: anarrangement of hardware and software adapted to receive a spoken phrase;an arrangement of hardware and software adapted to apply an acousticmodel to the phrase to select proposed words for the phrase; aprediction arrangement of hardware and software adapted to perform actsof: A) determining a structure of the phrase from a beginning of thephrase to a j-th word, wherein j is a positive integer; B) based atleast in part on the determined structure of the phrase up to said j-thword, identifying one or multiple partial analysis trees modifying the(j+1)th word; C) predicting said (j+1)th word from the proposed wordsbased on said one or multiple partial analysis trees; D) obtaining aputative sentence structure for the phrase including the predicted(j+1)th word and a probability value for said putative sentencestructure; E) increasing j; and F) while the j-th word is not a lastword of the phrase, repeating steps A-E; an arrangement of hardware andsoftware adapted to select as speech recognition results for the phrasea sentence structure and a word sequence obtained in steps A-F havingmaximum probability values; and an arrangement of hardware and softwareadapted to return said speech recognition results to a user.
 12. Theapparatus according to claim 11 wherein said prediction arrangement isadapted to predict said (j+1)th word based solely on said one ormultiple partial analysis trees having a modification relationship withsaid (j+1)th word.
 13. The apparatus according to claim 11 wherein saidprediction arrangement is adapted to, when multiple partial analysistrees modifying the (j+1)th word are specified, predict said (j+1)thword based on said specified multiple partial analysis trees modifyingthe (j+1)th word.
 14. The apparatus according to claim 11 wherein saidprediction arrangement is adapted to, when a modification directionbetween said one or multiple partial analysis trees modifying the(j+1)th word and said (j+1)th word is not constant, specify saidmodification direction.
 15. The apparatus according to claim 11 whereinsaid prediction arrangement is adapted to, when multiple modificationsare established between said one or multiple partial analysis treesmodifying the (j+1)th word and said (j+1)th word, predict a different(j+1)th word for each of said modifications.
 16. The apparatus accordingto claim 11 wherein said prediction arrangement is adapted to utilize adictionary database, containing appearance frequencies of apredetermined word relative to another predetermined word and/or wordsequence obtained from text data that has been learned, to determinewhich of multiple partial analysis trees are to be utilized forpredicting said (j+1)th word.
 17. The apparatus according to claim 11,further comprising: a display device which displays said speechrecognition results.
 18. The apparatus according to claim 11 furthercomprising: a storage medium which stores said speech recognitionresults.
 19. The apparatus according to claim 11 further comprising: aprinter which prints said speech recognition results.
 20. The apparatusaccording to claim 11, wherein: the act B comprises identifying one ormore partial analysis trees from the determined structure of the phraseup to the j-th word, and determining which of the one or more partialanalysis trees modify the (j+1)th word; and the act C comprisespredicting the (j+1)th word based only on those of the one or morepartial analysis trees that are determined to modify the (j+1)th word.21. A program storage device readable by computer, tangibly embodying aprogram of instructions executable by the computer to perform a methodfor speech recognition, said method comprising acts of: A) receiving aspoken phrase; B) applying an acoustic model to the phrase to selectproposed words for the phrase; C) determining a structure of the phrasefrom a beginning of the phrase to a j-th word, wherein j is a positiveinteger; D) based at least in part on the determined structure of thephrase up to said j-th word, identifying one or multiple partialanalysis trees modifying the (j+1)th word; E) predicting said (j+1)thword from the proposed words based on said one or multiple partialanalysis trees; F) obtaining a putative sentence structure for thephrase including the predicted (j+1)th word and a probability value forsaid putative sentence structure; G) increasing j; H) while the j-thword is not a last word of the phrase, repeating steps C-G; I) selectingas speech recognition results for the phrase a sentence structure and aword sequence obtained in steps C-H having maximum probability values;and J) returning said speech recognition results to a user.
 22. Theprogram storage device according to claim 21, wherein: the act Dcomprises identifying one or more partial analysis trees from thedetermined structure of the phrase up to the j-th word, and determiningwhich of the one or more partial analysis trees modify the (j+1)th word;and the act E comprises predicting the (j+1)th word based only on thoseof the one or more partial analysis trees that are determined to modifythe (j+1)th word.